Part II - (Dataset Exploration: Communicate Data Findings Ford GoBike System Data)¶

by (Samy Haliem)¶

Investigation Overview¶

The start date/time and duration of each trip can be used to understand how long a trip typically takes and when it is most likely to occur. The user information, such as user type, gender, and age, can be used to identify the main target customer groups. By summarizing the bike usage data for different groups of riders, we can see if there are any special patterns associated with a specific group.

For example, we might find that subscribers tend to take longer trips than customers, or that men tend to take more trips than women. We might also find that younger people tend to take shorter trips than older people.

This information can be used to improve the bike-sharing service by targeting different groups of riders with different marketing messages. For example, we might target subscribers with messages about longer trips, or we might target women with messages about safety.

Here are some specific questions you could ask:

What is the average trip duration for subscribers? For customers? What are the most popular times of day for trips? For weekdays? For weekends? What are the most popular start and end stations? Do men or women take more trips? Do younger or older people take more trips? Do subscribers or customers take longer trips? By asking these questions and exploring the data, you can learn more about how people are using the bike-sharing service and how it can be improved.

Dataset Overview¶

The original combined data set contains approximately 183,412 individual trip records. There are 16 variables in this data set, which can be divided into three major categories:

Trip duration: This category includes the duration_sec, start_time, and end_time variables. These variables provide information about the length of each trip, the time the trip started, and the time the trip ended. Station information: This category includes the start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude, and end_station_longitude variables. These variables provide information about the start and end stations for each trip. Member information (anonymized): This category includes the bike_id, user_type, member_birth_year, member_gender, and bike_share_for_all_trip variables. These variables provide information about the bike used for each trip, the user type (member or casual rider), the member's birth year, the member's gender, and whether the member has a bike share pass for all trips. In addition to the original variables, the following derived features were created to assist with exploration and analysis:

Trip information: The duration_min variable was created by dividing the duration_sec variable by 60. The s_day, and s_hour, were created by extracting the corresponding information from the start_time variable. Member: The member_age variable was created by calculating the age of the member based on their birth year.

In [1]:
# import all packages and set plots to be embedded inline
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline 

sns.set()
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
In [3]:
# load in the dataset into a pandas dataframe
bike_19=pd.read_csv("cleand_fordgobike-tripdata.csv")

What is the average trip time?¶

The majority of trips took between 5 and 10 minutes.¶

In [4]:
fig = px.histogram(
    data_frame=bike_19,
    x='duration_min',
   
    title='Distribution of Trip Durations'
)

fig.update_layout(
    xaxis_title='Duration (minutes)',
    yaxis_title='Count',
    xaxis_range=[0, 70]
)

fig.show()

What are the hours and days of the week when the trip takes the longest?¶

The number of bike trips peaks twice a day, during the morning rush hour (8am-9am) and the evening rush hour (5pm-6pm). These are the times when people are most likely to be commuting to and from work or school. The fact that the majority of rides happen on weekdays also suggests that bike sharing is primarily used for commuting.¶

In [5]:
## creating a function that carries the visualizations titles,x_label and y_label
def plot_label(title,x_label,y_label):
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.xticks(rotation=45)
## Explore the trip distribution along a day
color_base=sns.color_palette()[0]
sns.countplot(data=bike_19,x='s_hour',color=color_base)
plot_label("Trip Start Hour Of The Day","Hour Of Day","Count")
plt.show();
In [6]:
## Explore the trip distribution along a week
days = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
sns.countplot(data=bike_19,x='s_day',color=color_base,order=days)
plot_label("Trip Start Day Of The Week ","Day Of The Week","Count")

How does the trip duration distribution differ between customers and subscribers?¶

The trip duration distribution is much narrower for subscribers compared to casual riders on the shorter/quicker trip end overall. It seems like subscribers have a more specific usage or targeted goal riding the bikes compared to customers who vary more and generally rented the bikes for longer.¶

In [7]:
sns.violinplot(data=bike_19.query('duration_min<=60'), x='user_type', y='duration_min', color=color_base,inner='quartail')
plot_label("trip duration distribution",'User Type','Trip Duration in Minutes')

AVG duration due to the day of the week¶

Riding trips are much shorter on Monday through Friday compared to weekends This indicates a pretty stable and efficient usage of the sharing system on normal workdays, while more casual and flexible use on weekends¶¶

In [8]:
sns.barplot(data=bike_19, 
            x="s_day", 
            y="duration_min",
            order=days,
            color=color_base)

plot_label("Trip Duration by Day of Week","Day of Week",'Average Trip Duration (minutes)')

AVG member age due to the day of the week¶

Riders who rent bikes Monday through Friday are slightly older than those who ride on¶

In [9]:
sns.barplot(data=bike_19,x="s_day",y="member_age",color=color_base,order=days)
plot_label("AVG member age  due to the day of the week","Day Of The Week","Age");

Does the hour of the day depend on user type?¶

Subscriber usage clearly peaks during typical rush hours when people go to work in the morning and get off work in the afternoon. This double confirms their usage purpose and goal of riding. A similar pattern was not observed among customers who tend to ride most in the afternoon or early evening for a different purpose than the subscriber riders¶¶

In [10]:
sns.countplot(data=bike_19,x='s_hour',hue='user_type')
plot_label("Hour of the day vs user_type","Hour Of Day","Count");

Does the weekly day depend on user_type?¶

There was much more subscriber usage than casual customers overall. The drop in volume on weekends for subscribers indicates that they primarily ride bikes for work commutes during workdays. Conversely, the slight increase in use for customers on weekends demonstrates that the use was more for leisure/touring and relaxing purposes.¶

In [11]:
sns.countplot(data=bike_19,x='s_day',hue='user_type',order=days)
plot_label("weekly day vs user_type","User Type","Count");
In [ ]:
!jupyter nbconvert Part_2_Ford_GoBike_System_Data.ipynb --to slides --post serve --no-input --no-prompt